Main features of interest in your dataset?
Based on the above analysis the PropserScore, BorrowedAPR and the Loan Amount seem to be features of interest
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Prosper Loan Dataset: The original Prosper Loan Dataset contains information of loans cetgorized across 81 variables. For the purpose of this analysis we are subsetting this dataset to analyse the data pertaining to 16 variables namely ‘CreditGrade’,‘Term’,‘LoanStatus’,‘BorrowerAPR’, ‘BorrowerRate’,‘ListingCategory..numeric.’,‘BorrowerState’,‘Occupation’, ‘EmploymentStatus’,‘EmploymentStatusDuration’,‘IsBorrowerHomeowner’, ‘IncomeRange’,‘LoanNumber’,‘LoanOriginalAmount’,‘LoanOriginationDate’ and ‘LoanOriginationQuarter’.
## 'data.frame': 113937 obs. of 16 variables:
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
Understanding the range of values for each column:
CreditGrade:
levels(ld$CreditGrade)
## [1] "" "A" "AA" "B" "C" "D" "E" "HR" "NC"
Term in Months:
range(ld$Term)
## [1] 12 60
Term in Months Distribution:
table(ld$Term)
##
## 12 36 60
## 1614 87778 24545
Loan Status:
levels(ld$LoanStatus)
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (1-15 days)" "Past Due (16-30 days)"
## [9] "Past Due (31-60 days)" "Past Due (61-90 days)"
## [11] "Past Due (91-120 days)" "Past Due (>120 days)"
Loan Status Distribution:
table(ld$LoanStatus)
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
Borrower APR:
ld_without_NA <- subset(ld, !is.na(ld$BorrowerAPR))
range(ld_without_NA$BorrowerAPR)
## [1] 0.00653 0.51229
Loan Original Amount:
range(ld$LoanOriginalAmount)
## [1] 1000 35000
Employment status duration in months:
ld_without_NA <- subset(ld, !is.na(ld$EmploymentStatusDuration))
range(ld_without_NA$EmploymentStatusDuration)
## [1] 0 755
Prosper score for Loans:
ld_without_NA <- subset(ld, !is.na(ld$ProsperScore))
range(ld_without_NA$ProsperScore)
## [1] 1 11
table(ld$ProsperScore)
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
ggplot(aes(x = Term), data = ld) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x = CreditGrade), data = ld) +
geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(aes(x = LoanStatus), data = ld) +
geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(aes(x = BorrowerAPR), data = ld) +
geom_histogram(binwidth = 0.005) +
scale_x_continuous(breaks = seq(0,0.5,0.05))
## Warning: Removed 25 rows containing non-finite values (stat_bin).
ggplot(aes(x = EmploymentStatus), data = ld) +
geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(aes(x = IncomeRange), data = ld) +
geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(aes(x = LoanOriginalAmount), data = ld) +
geom_histogram(binwidth = 1000) +
scale_x_continuous(breaks = seq(0,35000,2500))
ggplot(aes(x = ListingCategory..numeric.), data = ld) +
geom_histogram(stat = "count") +
scale_x_continuous(breaks = seq(0,20,1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(aes(x = BorrowerState), data = ld) +
geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(aes(x = EmploymentStatusDuration), data = ld) +
geom_histogram(binwidth=30) +
scale_x_continuous(breaks = seq(0,600,30))
## Warning: Removed 7625 rows containing non-finite values (stat_bin).
ggplot(aes(x = ProsperScore), data = ld) +
geom_histogram() +
scale_x_continuous(breaks = seq(1,11,1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
ggplot(aes(x = ProsperScore), data = subset(ld,!is.na(IsBorrowerHomeowner))) +
geom_histogram(aes(color = IsBorrowerHomeowner)) +
scale_x_continuous(breaks = seq(1,11,1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
TERM HISTOGRAM ANALYSIS: It is observed that all the loans are either of 12,36 or 60 months duration. The maximum number of loans have a term of 36 months followed by 60 month term loans.
CREDIT GRADE HISTOGRAM ANALYSIS: It is observed that max loans have no credit grade assigned. Among the rest ‘C’ Credit Grade loans are marginally higher than the rest grades.
LOAN STATUS HISTOGRAM ANALYSIS: It is observed that approximately 5000 loans have defaulted status and maximum loans in the dataset are current. The pst due loans are comparitveley very less.
BORROWER APR HISTOGRAM ANALYSIS: It is observed that maximum loans have been issued at an APR of 36%.
EMPLOYMENT STATUS HISTOGRAM ANALYSIS: It is observed that maximum loans have been issued to Employed people and very few loans have been issued to not employed and part time people.
INCOME RANGE HISTOGRAM ANALYSIS: The histogram is a perfect bell curve and shows that max loans have been given to the middle income groups with $25K-49K and $50K-74.99K and the loans fall off on either side as the income rises and falls.
LOAN ORIGINAL AMOUNT HISTOGRAM ANALYSIS: The most frequently sanctioned Loan amounts seems to be 4K$, 10K and 15K dollars. There are also loans sanctioned for 35K dollars.
LISTING CATEGORY HISTOGRAM ANALYSIS: Maximum loans have been sanctioned listing catgeory 1(Debt Consolidation) followd by 0(Not Available) and 7(Other). Very few loans have been sanctioned to 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans followed by 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
BORROWER STATE HISTOGRAM ANALYSIS: Max lons have been sanctioned to people from state of California followed by approximatley equal number of loans to people from Florida, Illinois, New York and Texas.
EMPLOYMENT STATUS DURATION HISTOGRAM ANALYSIS: It is observed that max loans have been given to people who have only spent a month or two in their employment status and there is a clear downward march of loans given out decreasing as the employment status duration increases.
PROSPER SCORE HISTOGRAM ANALYSIS: It is observed that max loans have a prosper score of 4,6 and 8 while there are comparitevely very few loans having the worst score of 1 and best score of 11. We also see that maximum loans have been given to non home owners as compared to home owners across all the prosper score ratings.
Based on the above analysis the PropserScore, BorrowedAPR and the Loan Amount seem to be features of interest
IsBorrowerHomeowner, Term and EmploymentStatusDuration are the features which could support the investigation.
The employment status duration histogram was unexpected as it showed that the loans granted to people decreased as their duration in the respective employment increased. It perhaps could be attributed to the fact that max loans were given out to new employees upon getting a job within the first two months.
table(ld$Term)
##
## 12 36 60
## 1614 87778 24545
## Warning: Removed 25 rows containing missing values (geom_point).
## Warning: Removed 25 rows containing missing values (geom_point).
## Warning: Removed 7625 rows containing missing values (geom_point).
ggplot(aes(x = (EmploymentStatusDuration%%12), y = BorrowerAPR), data = ld) +
geom_point()
## Warning: Removed 7625 rows containing missing values (geom_point).
## Warning: Removed 7625 rows containing missing values (geom_point).
## Warning: Removed 7625 rows containing missing values (geom_point).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 29084 rows containing missing values (geom_point).
## Warning: Removed 29103 rows containing missing values (geom_point).
## Warning: Removed 29084 rows containing missing values (geom_point).
** Term vs Borrower APR ANALYSIS**: There are only three values of term ie 12,36 and 60 months. However from this scatterplot we can arrive at the conclusion that there is a wider range of Borrower APR values for the 36 months term compared to the 60 and 12 month terms.
** Borrower APR vs LoanOriginalAmount ANALYSIS**: As the Loan amount increases the Borrower APR moves to a range between 0.1 to 0.2 from 0.05 to 0.4. There seems to be a negative corelation between Borrower APR and Loan Amount.
** Borrower APR vs EmploymentStatus duration**: We see that Employment status duration has no effect on the Borrower APR
** LoanOriginalAmount vs EmploymentStatus duration**: We see that Employment status duration has no effect on the Borrower APR
** Term vs LoanOriginalAmount**: We see that LoanOriginalAmount has no effect on the Term of the loan.
** ProsperScore vs LoanOriginalAmount**: We see that majority of loans having LoanOriginalAmounts in excess of 25000 have prosper score greater than 5.
** ProsperScore vs BorrowerAPR**: There does not seem to be any relation between the proper score and the borrower APR.
** ProsperScore vs EmploymentStatusDuration**: There are very few loans with a low prosper score of 1 and duartion greater than 400 days.
** ProsperScore vs Term**: There seems to be a consistent distribution of loans with the entire range of prosper scores across the three terms.
The relationship between BorrowedAPR and the LoanOriginalAmount is interesting and will be further investiagted in the next section. Using this relationship along with the relationships of ProsperScore vs BorrowerAPR and ProsperScore vs LoanOriginalAmount can help us derive some intelligence from the data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ld_by_p_score <- group_by(ld, ProsperScore)
ld.BorrowerAPR_by_P_Score_mean <- summarise(ld_by_p_score,
BorrowerAPR_mean = mean(BorrowerAPR),
BorrowerAPR_median = median(BorrowerAPR),
LoanAmount = mean(LoanOriginalAmount),
n = n())
ld.BorrowerAPR_by_P_Score_mean <- arrange(ld.BorrowerAPR_by_P_Score_mean, ProsperScore)
ggplot(aes(x = ProsperScore, y = LoanAmount), data = ld.BorrowerAPR_by_P_Score_mean)+
geom_line()
## Warning: Removed 1 rows containing missing values (geom_path).
ggplot(aes(x = ProsperScore, y = BorrowerAPR_mean), data = ld.BorrowerAPR_by_P_Score_mean)+
geom_line()
## Warning: Removed 1 rows containing missing values (geom_path).
Relationship between Mean Loan Amount and the ProsperScore: We want to find a realtionship between the mean Loan amount sanctioned and the ProsperScore. We see that there is an increasing linear relationship between the mean Loan Amount and the Prosper Score.
Relationship between Mean Borrower APR and the ProsperScore: We want to find a relationship between the Mean Borrower APR and the Prosper Score. To do this we group the loans in ld dataframe by Propser Score and the find the mean Borrower APR of each group. Plotting this we find a very close inverse linear relationship between ProsperScore and the mean Borrower APR.
## Warning: Removed 25 rows containing missing values (geom_point).
The Range of Borrower APR seems to clearly decrease as the loan amount increases Loans below 5K$ have an APR range from 0.05 to 0.4 whereas loans upward of 30K$ have an APR between 0.1 and 0.2
ggplot(aes(x = ProsperScore, y = LoanAmount), data = ld.BorrowerAPR_by_P_Score_mean)+
geom_line()
## Warning: Removed 1 rows containing missing values (geom_path).
The mean Loan Amount of loans grouped by ProsperScore has a clear approximate increasing linear relationship with the ProsperScore. It can be understood logically that highly rated loans have a larger Loan Amount than loans with poor rating.
ggplot(aes(x = ProsperScore, y = BorrowerAPR_mean), data = ld.BorrowerAPR_by_P_Score_mean)+
geom_line()
## Warning: Removed 1 rows containing missing values (geom_path).
As an inverse of the previous graph we can see here that the APR mean falls with increasing Prosper Score. Again this can be understood in a way that the loans with higher Prosper SCore have a mean APR lower than the loans with poorer Prosper Scores.
It was a difficult process to analyze this dataset. The intial difficulty lay in trying to guess the interesting variables for analysis from a total of 81 variables. The distributions of the variables helped understand the nature of the data and the distribution which in turn help guess the pair of variables that could seem to have some relationship between them. After narrowing down to the interesting relationship between Borrowed APR and Loan Amount it was decided to explore this relationship further using the ProsperScore variable. Finally we succeeded in establishing that both the Loan Amount and Borrowed APR varied in a linear fashion compared to the ProsperScore variable.